A statistical package for evaluation of hybrid performance in plant breeding via genomic selection

Hybrid breeding employs heterosis, which could potentially improve the yield and quality of a crop. Genomic selection (GS) is a promising approach for the selection of quantitative traits in plant breeding. The main objectives of this study are to (i) propose a GS-based approach to identify potential parental lines and superior hybrid combinations from a breeding population, which is composed of hybrids produced by a half diallel mating design; (ii) develop a software package for users to carry out the proposed approach. An R package, designated EHPGS, was generated to facilitate the employment of the genomic best linear unbiased model considering additive plus dominance marker effects for the hybrid performance evaluation. The R package contains a Bayesian statistical algorithm for calculating genomic estimated breeding value (GEBVs), GEBV-based specific combining ability, general combining ability, mid-parent heterosis, and better-parent heterosis. Three datasets that have been published in literature, including pumpkin (Cucurbita maxima), maize (Zea mays), and wheat (Triticum aestivum L.), were reanalyzed to illustrate the use of EHPGS.


Materials and methods
The genomic selection-based approach. The GBLUP model. The GBLUP model considering additive plus dominance effects can be described as follows: where y is the vector of the phenotypic values; 1 n is the unit vector of length n (here n is the number of phenotypic values); g A is the vector of genotypic values for the additive effects; g D is the vector of genotypic values for the dominance effects; and e is the vector of random errors. It is assumed that g A , g D , and e are mutually independent and follow multivariate normal distributions, denoted by g A ∼ N 0, σ 2 A K A , g D ∼ N 0, σ 2 D K D , and e ∼ N 0, σ 2 e I n . Here, K A = 1 p (X A X T A ) is the genomic relationship matrix for the additive effects, abbreviated as A-GRM; the variance component σ 2 A represents the cumulative variability of additive marker effects, abbreviated as A-VC; K D = 1 p (X D X T D ) is the genomic relationship matrix for the dominance effects, abbreviated as D-GRM; and the variance component σ 2 D represents the cumulative variability of the dominance marker effects, abbreviated as D-VC. For the additive effects, the SNP at each locus is coded as − 1, 0, or 1 for the homozygote of the minor allele, the heterozygote, and the homozygote of the major allele, respectively. For the dominance effects, the marker score is coded as 1 for the heterozygote, and 0 for both homozygotes. Then, X A and X D are the standardized marker score matrices for the additive effects and dominance effects, respectively, and p is the number of the SNP markers.
Estimation for GEBVs and genomic heritability. Let µ be the best linear unbiased estimate (BLUE) for µ , g A be the BLUP for g A , and g D be the BLUP for g D . Then, µ , g A , and g D can be obtained from the Henderson's equations 17 : where A = σ 2 e /σ 2 A and D = σ 2 e /σ 2 D . Here, A and D can be replaced with suitable estimates for σ 2 e , σ 2 A , and σ 2 D , respectively denoted by σ 2 e , σ 2 A , and σ 2 D . The estimate for genomic heritability was then obtained as: In this study, the breeding population was composed of all possible hybrid combinations in a half diallel mating design. Let K

(bp)
A and K (bp) D respectively denote the A-GRM and D-GRM between the breeding population and the training population. Moreover, let g (bp) A and g (bp) D denote the BLUPs for the breeding population of additive and dominance effects, respectively. From the article 18 , g (bp) A and g (bp) D can be obtained as: and The genomic estimated genotypic values for the individuals in the breeding population were then predicted by: where N 1 is the number of hybrid combinations in the breeding population. Here, N 1 = C N 0 2 with N 0 as the number of parental lines.
Estimation for GCA, SCA, MPH, and BPH. Let GCA i and GCA j separately denote the GCAs for the parental lines P i and P j , and SCA ij denote the SCA for their hybrid combination P i ⨂ P j . Moreover, let g (ij) A and g (ij) D denote the BLUPs for P i ⨂ P j of additive and dominance effects, respectively. From the article 19 , and (1)  15 , the GEBV-based MPH and BPH for P i ⨂ P j can be estimated by: and where GCA i − GCA j is the absolute value of ( GCA i − GCA j ). Under the positive heterosis assumption, the value of MPH or BPH is larger, and the heterosis of the hybrid combination is stronger.
The Bayesian statistical algorithm. For a given training population with known phenotype and genotype data, a Bayesian Gibbs sampling (BGS) algorithm, modified from an algorithm presented in the article 20 , was used to estimate the required parameters. The algorithm can be described as follows.
• Step 1: Set initial values for the parameters in the model.

•
Step 2: Rewrite Eq. (2) as Here, C i,−i denotes C i,j for all j = i ; and g −i is g j for all j = i.

•
Step 3: Calculate the vector of residuals as: e = y − g 1 − g 2 − g 3 . • Step 4: Update σ 2 e as σ 2 e = (e T e + S * v * )/χ 2 n+v * , where χ 2 n+v * is the chi-square random variate with n + v * degrees of freedom; S * = 0.5V with V as the sample variance of the values in y ; and v * = 5.
. www.nature.com/scientificreports/ An R package called as EHPGS generated for executing the proposed approach is available from GitHub (https:// github. com/ spcsp in/ EHPGS). A referenced manual and a tutorial including a demonstration example are provided in the package.
A comparison study. The pumpkin dataset was analyzed using a two-stage approach in the article 15 , in which the authors first estimated GEBVs, SCAs, GCAs, MPHs, and BPHs based on a whole genome regression model using Bayes C estimation in the R package BGLR 21 . Then, they calculated A-GRM and D-GRM by the two different formulas 22,23 . The restricted maximum likelihood estimation (REML) method was performed for estimating the variance components by using another R package sommer 24 . A comparison of the results obtained from the two-stage approach and ours was discussed in the next section.
The Bayesian reproducing kernel Hilbert space (RKHS) method in BGLR is another Bayesian algorithm that has been commonly used to perform GEBV prediction for the GBLUP model in Eq. (1). To compare the use of the Bayesian RKHS method with our proposed BGS algorithm, the three datasets was reanalyzed by using BGLR. The priors specified in BGLR were the same as ours, the number of iterations was set to 10,000, the number of burn-in was fixed at 9000, and the number of chains was set to five (the BGLR function was repeatedly run five times). These settings are exactly the same as our algorithm in analyzing the datasets.
A simulation study. To further examine whether the proposed BGS algorithm can more accurately estimate known variance components compared to established methods, such as the REML method in sommer, and the Bayesian RKHS method in BGLR, a simulation study was conducted as follows. The estimated values for the model parameters obtained from the training data (displayed in Table 3) were used to generate 3000 sets of phenotype data for the training population in each dataset (119, 276, and 600 realized observations in each simulated dataset for the pumpkin, maize, and wheat datasets, respectively). For a stimulated dataset, the variance components were estimated by the REML, Bayesian RKHS, and our BGS methods.

A cross-validation analysis.
A tenfold cross-validation analysis using empirical data was also performed to compare the accuracy on GEBV prediction among the three methods. There were 119 and 276 empirical observations available in the pumpkin and maize datasets, respectively. For the sake of computational cost saving, 500 individuals randomly selected from the 2556 available hybrids in the wheat dataset were used for this analysis. The procedure can be described as follows.
Step 1: Each of the three datasets was partitioned into 10 exclusive clusters at random. Step 2: During the cross-validation process, each of the 10 clusters was progressively and alternately used as the testing set. At the same time, the remaining nine clusters were pooled as the training set.
Step 3: After the GEBV prediction by each method, Pearson's correlation between GEBVs and phenotypic values in the testing set was calculated for each dataset. Here, the procedure was repeated five times to generate 50 correlation coefficients for each dataset.
The genome datasets. Three datasets that have been published in literature were reanalyzed to illustrate the use of EHPGS.
Pumpkin dataset. A pumpkin dataset which contained 119 intra-crossing hybrid combinations of C. maxima with phenotypic values for fruit weight (FWT) (kg) was analyzed for evaluation of hybrid performance 15 . The phenotype data were historical data collected from 1988 to 2016. All the trials were conducted at a single location experiment in southern area of Taiwan. Every hybrid had six to ten observations at each time point, and the average of them was used as the phenotypic observation for the hybrid of the year. Because the phenotypic values of every hybrid were observed for more than one year, the different year effects were therefore removed based on the assumption that they were random effects following a normal distribution.
The germplasm collection of the pumpkin set consisted of 320 parental lines, which were classified into three clusters: C. maxima with 142 inbred lines, C. pepo with 60 inbred lines and C. moschata with 118 inbred lines. After SNP calling, 76,815 SNPs were extracted from the parental lines, and only 4,521 SNPs remaining for C. maxima after the filtering by missing rate ≥ 0.05, minor allele frequency (MAF) < 0.05, and a series of operations for determining linkage disequilibrium (LD) blocks. The 142 inbred lines produced C 142 2 = 10, 011 potential hybrid combinations in a half diallel mating design. The means adjusted from the year effects for the 119 C. maxima hybrids were used in the current study to build a GBLUP model for evaluating the performance of the 10,011 hybrid combinations.
Maize dataset. A maize dataset was analyzed to study the optimal designs for GS in hybrid crops, which consisted of 276 hybrids derived from 24 parental lines in a half diallel mating design 2 . The 24 diverse parents were classified into two groups according to the germplasm origin and a principal component analysis. The two groups were (i) the temperate and mixed (TM) group, consisting of 11 inbred lines (i.e., B73, B97, Ky21, M162W, Mo17, MS71, Oh43, OH7B, M37W, Mo18W, and Tx303); and (ii) the tropical and sub-tropical (TS) group consisting of the remaining 13 inbred lines (i.e., CML52, CML69, CML103, CML228, CML247, CML277, CML322, CML333, Ki3, Ki11, NC350, NC358, and Tzi8). There were C 11 2 = 55 hybrid combinations in the TM group, C 13 2 = 78 hybrids in the TS group, and 11 × 13 = 143 hybrids between the two groups. Three trait values, flowering time, ear height, and grain yield (YLD) (Mg/ha), were evaluated for all of the hybrids at two locations (i.e., Columbia, MO and Clayton, NC) in 2005 and 2006. In our study, the combined BLUP values from the two locations for YLD were evaluated. www.nature.com/scientificreports/ Genotype data for the 24 inbred lines were extracted from the Maize HapMap V2 25 at www. panzea. org, which consisted of 10,296,310 SNP markers. The SNP markers were first filtered by missing rate ≥ 0.05 and MAF ≤ 0.1, resulting in 134,726 SNPs remaining. Missing genotypes were then imputed with the homozygote of the major allele. To screen out reliable SNPs for building a GBLUP model, the retained SNPs were further filtered by LD blocks. The LD parameter r 2 (i.e., the squared Pearson's correlation coefficient) of the SNPs for each chromosome was estimated using TASSEL5.2.41 26 with a sliding window = 10. A smooth function between r 2 and the physical distance (bp) was built using an R function loess.smooth( ) with a second-degree locally weighted polynomial regression. The LD decay of ten chromosomes is displayed in Fig. S1 of the Supplementary Materials. Filtering the 134,726 SNP markers by the LD block sizes if r 2 approached 0.2, resulting in 46,134 SNPs remaining. A SNP was also deleted if its corresponding column for the dominance effects was a zero vector. Finally, 30,239 SNP markers were retained for further analysis. In the current study, all 276 hybrids with known trait values were used as the training population for the prediction model construction.
Wheat dataset. A genome-based establishment of a high-yielding heterotic pattern for hybrid wheat breeding was investigated, and the study was based on 135 advanced elite winter wheat lines 27 . A set of 1604 wheat hybrids produced from crosses among the 15 male lines and 120 female lines were then evaluated for grain yield (YLD) (Mg/ha) in 11 environments. Grain yield data for all C 135 2 = 9045 unique hybrids were predicted based on those of the phenotyped individuals. For the genotype data, the 135 lines were fingerprinted by using a 90,000 SNP array based on an Illumina Infinium array. After quality tests, 17,372 high-quality SNP markers were retained.
To study optimal designs for GS, 2556 hybrid combinations, produced by the half diallel mating design on 72 lines selected from the original 135 elite wheat lines, were analyzed in the article 2 . An optimal training population with 600 individuals, determined by the r-score criterion 28 , was used in the current study to build the GBLUP model for the performance evaluation on the 2556 hybrid combinations.

Results and Discussion
Pumpkin dataset. By the half diallel mating design, the 142 parental lines produced C 142 2 = 10, 011 hybrid combinations in the breeding population. For illustration purposes, we only reported the top 25 superior hybrid combinations with the largest GEBVs, together with their SCAs, MPHs, and BPHs in Table 1; and the top 10 potential parental lines with the largest GCAs in Table 2. Table 1 illustrates the important finding that both MPH ij and BPH ij are greater than 0 for all of the selected hybrids, showing that they had better performance in Table 1. The top 25 superior hybrid combinations with the largest GEBVs for fruit weight (FWT) within a pumpkin population. Note that GEBV ij is the genomic estimated breeding value; SCA ij is the specific combining ability; MPH ij is the mid-parent heterosis; BPH ij is the better-parent heterosis for hybrid P i ⨂ P j . www.nature.com/scientificreports/ FWT than both of their parents. More interestingly, every superior hybrid presented in Table 1 was derived from one or two of the potential parental lines presented in Table 2. Particularly, P026, the parental line with the highest GCA, involved the top 11 hybrids with the greatest GEBVs among the 25 selected hybrids. The estimates for the variance components and genomic heritability are shown in Table 3. From the table, the estimates of the A-VC, D-VC, and genomic heritability are given by σ 2 A = 0.306, σ 2 D = 0.159 , and h 2 = 0.807 . The high heritability explains why the values of MPH ij and BPH ij in Table 1 Table 4; and the top 5 potential parental lines with the largest GCAs in Table 5.  The estimates for the variance components and genomic heritability are also displayed in Table 3. From the table, the estimates of the estimates of the A-VC, D-VC, and genomic heritability are given by σ 2 A = 0.434, σ 2 D = 0.420 , and h 2 = 0.415 , partially explaining why the values of MPH ij and BPH ij in Table 4 are all positive, and showing that there is an obvious heterosis in YLD within the breeding population.
Wheat dataset. By the half diallel mating design, the 72 parental lines produced C 72 2 = 2556 hybrid combinations in the breeding population. For illustration purposes, we only reported the top 20 superior hybrids with the largest GEBVs, together with their SCAs, MPHs, and BPHs in Table 6; and the top 10 potential parental lines with the largest GCAs in Table 7. The estimates for the variance components and genomic heritability are also displayed in Table 3. From Table 6, all of the MPH ij are greater than 0, showing that they had a larger YLD than the mean YLD of their parents. Most of the BPH ij are noticeably smaller than MPH ij , probably because the additive effects ( σ 2 A = 0.066, Table 3) were stronger than the dominance effects ( σ 2 D = 0.014 , Table 3). Moreover, 11 of the 20 BPHs are negative, showing that the corresponding hybrids were inferior to their better-parents. Every superior hybrid presented in Table 6 was derived from one or two of the potential parental lines presented Table 5. The top five potential parental lines with the largest GCAs for grain yield (GYD) within a maize population. GCA i is the general combining ability for parental line P i . TS the tropical and subtropical group, TM the temperate and mixed group.  Table 6. The top 20 superior hybrid combinations with the largest GEBVs for grain yield (GYD) within a wheat population. Note that GEBV ij is the genomic estimated breeding value; SCA ij is the specific combining ability; MPH ij is the mid-parent heterosis; BPH ij is the better-parent heterosis for hybrid P i ⨂ P j . www.nature.com/scientificreports/ in Table 7. Particularly, F102, the parental line with the highest GCA (Table 7), involved 17 of the 20 selected superior hybrids (Table 6). In summary, the BPH values were consistently positive for the top hybrids in both the pumpkin and maize datasets, implying that there exists a strong and useful heterosis in the two crops. The valuable result can also be found in literature 15,29 . However, only a few of the top hybrids had a positive but too small BPH value in the case of wheat, indicating that the heterosis existing in this dataset may not be adequate for practical utility. A wheat hybrid has a small positive or negative BPH value because one of its parents is inferior 30 . The correlation between phenotypic values and GEBVs. Scatter plots of all available phenotypic values (119, 276, and 2556 individuals in the pumpkin, maize, and wheat datasets, respectively) and their GEBVs in each dataset are displayed in Figs. 1, 2 and 3. The respective Pearson's correlation coefficients are 0.9691, 0.6786, and 0.9445. From the figures, most of the selected superior hybrids appeared in the upper right-hand corners, meaning that the selected hybrids with higher GEBVs also have higher actual phenotypic values. This is a valuable result because phenotypic selection is usually costly and time-consuming for selective breeding. The great consistency exists between the results of genomic selection and phenotypic selection, supporting that the proposed GS-based approach can be recommended for practical applications. Table 7. The top 10 potential parental lines with the largest GCAs for grain yield (GYD) within a wheat population. GCA i is the general combining ability for the parental line P i . www.nature.com/scientificreports/ The results of the comparison study. The top 25 superior hybrids identified by the two-stage approach 15 , together with those identified by our proposed approach are displayed in Table S1 of the Supplementary Materials. The corresponding identified 10 potential parental lines are displayed in Table S2. Both sets of the results are highly consistent with each other. From Table S1, 18 hybrids were in common among the 25 hybrids selected by each approach, and the top six hybrids with the highest GEBVs were the same, even though the order was slightly different. Table S2 indicates seven potential parental lines in common among the 10 selected by each approach. The variance components for additive, dominance, random error effects, and genomic heritability   Overall, the results of the identified top parental lines and hybrid combinations between the Bayesian RKHS method in BGLR and our BGS algorithm were highly consistent with each other. Pearson's correlations between GEBVs and phenotypic values for the datasets are displayed in Table 8. From which, our proposed algorithm led to higher Pearson's correlations in the pumpkin and three maize datasets, but almost equal in the wheat dataset. Additionally, the estimates for variance components and genomic heritability by using the Bayesian RKHS method are displayed in Table 9. In comparison with those obtained from our BGS algorithm (Table 3), BGLR resulted in relatively low genomic heritability.
The results of the simulation study and the cross-validation analysis. Side-by-side box-plots for the estimates of the variance components over the 3000 repetitions in the simulation study are displayed in Fig. 4. From the figure, the two Bayesian methods of BGS algorithm and the Bayesian RKHS method generally led to larger bias but smaller dispersion than the REML method in the estimation. The performance of the methods might be dependent on different dataset-variance-component combinations. For example, BGS algorithm tended to overestimate σ 2 A , but the Bayesian RKHS method was likely to underestimate it in the pumpkin dataset. Moreover, BGS algorithm had slightly better performance in σ 2 e , but worse in σ 2 D than the Bayesian RKHS method in the dataset.
The mean and the standard deviation over the 50 resulting values in the cross-validation analysis are displayed in Table 10. From the table, the three methods had quite close performance in the three datasets. BGS algorithm, the REML method, and the Bayesian RKHS method outperformed the others in the maize, wheat, and pumpkin datasets, respectively. However, the margins were very small. According to the above results, the REML method in sommer and the Bayesian RKHS method in BGLR were also imported in EHPGS as options for the GEBV prediction and variance component estimation.

Conclusion
In this study, a software package called EHPGS was generated for identifying potential parental lines and superior hybrid combinations from a breeding population, which is composed of all possible hybrids produced by a half diallel mating design. A training population with known phenotype and genotype data is required to build the GBLUP model, and then a set of parental lines with known genotype data is also required to perform GEBV prediction for its derived hybrid combinations. Any dataset with such training population and parental line set can fit the package. For an input dataset, EHPGS generates GEBVs, SCAs, GCAs, MPHs, and BPHs for all potential candidates to achieve the task. Table 8. Pearson's correlations between GEBVs and phenotypic values for the datasets obtained from Bayesian RKHS method in BGLR and our proposed BGS algorithm. 1 Maize-A: the combined data from the two locations; Maize-B: the data from Columbia, MO; Maize-C: the data from Clayton, NC.